QDB-16881 - Exception when reading empty table with qdbpd.read_dataframe#101
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR addresses a ValueError when calling qdbpd.read_dataframe on an existing but empty table by forcing evaluation of the generator and returning an empty DataFrame.
- Modified
read_dataframeto materialize the generator and handle empty results. - Added a new test
test_read_dataframe_empty_table_sc16881to ensure empty tables yield an empty DataFrame.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/test_pandas.py | Added test_read_dataframe_empty_table_sc16881. |
| quasardb/pandas/init.py | Changed read_dataframe to list(...) the stream and return empty DataFrame if no batches. |
Comments suppressed due to low confidence (1)
tests/test_pandas.py:686
- Consider adding assertions to verify that the empty DataFrame has the expected column names (e.g.,
['d']or including the timestamp column) to strengthen this test.
assert df.empty
quasardb/pandas/__init__.py
Outdated
| # we need to evaluate the generator first, and then concatenate if result set is not empty. | ||
| dfs = list(stream_dataframe(conn, table, **kwargs)) | ||
|
|
||
| if len(dfs) == 0: |
There was a problem hiding this comment.
[nitpick] Use if not dfs: for a more idiomatic and concise empty-list check in Python.
| if len(dfs) == 0: | |
| if not dfs: |
There was a problem hiding this comment.
if len(dfs) == 0: is a check taken from pandas.concat
| dfs = list(stream_dataframe(conn, table, **kwargs)) | ||
|
|
||
| if len(dfs) == 0: | ||
| return pd.DataFrame() |
There was a problem hiding this comment.
Returning an empty DataFrame without any columns may break downstream code expecting the original table schema; consider initializing with the table's column names to preserve schema.
There was a problem hiding this comment.
same behavior as qdbpd.query function
quasardb/pandas/__init__.py
Outdated
| # stream_dataframe is *lazy* - is a generator, not a list. | ||
| # if result of stream_dataframe is empty this could result in ValueError on pd.concat() | ||
| # we need to evaluate the generator first, and then concatenate if result set is not empty. |
There was a problem hiding this comment.
[nitpick] These three comment lines could be consolidated into a single concise remark to reduce verbosity and improve readability.
| # stream_dataframe is *lazy* - is a generator, not a list. | |
| # if result of stream_dataframe is empty this could result in ValueError on pd.concat() | |
| # we need to evaluate the generator first, and then concatenate if result set is not empty. | |
| # stream_dataframe is lazy (a generator, not a list). Evaluate it first to avoid ValueError on pd.concat() if the result is empty. |
quasardb/pandas/__init__.py
Outdated
| # stream_dataframe is *lazy* - is a generator, not a list. | ||
| # if result of stream_dataframe is empty this could result in ValueError on pd.concat() | ||
| # we need to evaluate the generator first, and then concatenate if result set is not empty. | ||
| dfs = list(stream_dataframe(conn, table, **kwargs)) |
There was a problem hiding this comment.
This pulls everything into memory at once -- will this cause additional memory consumption when used in conjunction with pd.concat ?
The beautiful thing about using a lazy-evaluated generator for pd.concat is that we don't need to pull all individual dataframes into memory before pd.concat merges them into a single dataframe.
Assuming that pd.concat does not copy dataframes, this does not matter. If it does copy the data, this will suddenly cause a doubling of the required memory.
Could you look into this?
There was a problem hiding this comment.
Pondering further, we really only need to know whether there's at least 1 item -- we really don't need to evaluate all of them.
Is there perhaps a more elegant way to inspect lazily with functools or some other generator helper function/library whether there's at least one item?
There was a problem hiding this comment.
by default pd.concat() will make a copy of the data. we can specify copy argument to False and override this behavior https://pandas.pydata.org/docs/reference/api/pandas.concat.html
as for the second comment i didn't find an easy way to peek a generator
- we can manually call next() once on the generator, check if first element returns or raises
StopIterationexception (or we can specify default value if generator reaches an end). if there is no exception we can useitertools.chainto combine first (already evaluated) element with rest of the generator and pass this new "chain" topd.concat()
https://stackoverflow.com/questions/661603/how-do-i-know-if-a-generator-is-empty-from-the-start - just wrap
pd.concat()withtry|except ValueErrorblock
There was a problem hiding this comment.
Ok, this should be fixed:
- should not make a copy, should take the dataframes by reference instead (= faster);
- I think try/except this specific exception is the best approach, it's the simplest
There was a problem hiding this comment.
i agree, lets not over complicate this
solatis
left a comment
There was a problem hiding this comment.
Left comment, would be good to clarify upon that.
Story details: https://app.shortcut.com/quasardb/story/16881